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Background / Context: 

Description of prior research and its intellectual context. 

Randomized experiments are generally considered to provide the strongest basis for 
causal inferences about cause and effect. Consequently randomized field trials have been 
increasingly used to evaluate the effects of education interventions, products, and services. 
Populations of interest in education are often hierarchically structured (such as students nested 
within classrooms, nested within schools, nested within school districts). The sampling designs 
used in educational experiments often exploit this hierarchical structure by sampling entire intact 
units (such as schools or classrooms). The two most frequently used designs in education field 
experiments are variants of two designs: The hierarchical design or the (generalized) randomized 
block design (see Spybrook and Raudenbush, 2009). 

The hierarchical design assigns entire intact groups (such as schools) to treatments, so 
that every individual (or lower level hierarchical unit) in the group receives the same treatment. 
Because the intact groups assigned to treatments are statistical clusters (in the sense of sampling 
theory) the hierarchical design is often called the cluster-randomized design. For example, an 
experiment using the hierarchical design might assign entire schools (and all the classrooms and 
students within them) to receive the same treatment. 

In the randomized block design, individuals (or lower level units) are assigned to 
treatments within intact groups such as schools. Because the intact groups within which 
treatment assignment takes place are often geographical site, this design is often called the 
multisite design. It is also sometimes called the matched design because the units assigned to 
different treatments occur within the same higher level intact unit and are therefore matched by 
virtue of being in that same higher level unit. For example, an experiment using the randomized 
block design might involve several schools, but assign different individuals within each school to 
different treatments. Alternatively, it might assign entire different classrooms (and all of the 
individuals with those classrooms) to different treatments within each school. 

A crucial aspect of designing field experiments is the assessment of the statistical power 
of the test for treatment effects so that the investigator can be sure that the design has adequate 
sensitivity to detect the smallest treatment effects that are judged to be important. The statistical 
power of hierarchical designs depends on the effect size, the significance level, the sample size at 
each level of the design (e.g., the number of clusters and the number of individuals in each 
cluster), the way that the outcome variance is distributed across levels of the design (usually 
summarized by the intraclass correlation structure), and if covariates are used, the effectiveness 
of the covariates in explaining variation at each level of the design (usually summarized by 
variance explained or R statistics at each level of the design). 

While the significance level and the sample sizes are under the control of the investigator, 
the intraclass correlation structure and the effectiveness of the covariates are not. Moreover 
these parameters are often difficult to know precisely before the experiment has been carried out. 
Because of their importance in planning experiments a literature has emerged that provides an 
empirical grounding for intraclass correlations and covariate values. These values come from 
experiments that have already been conducted (e.g., Schochet, 2008), large urban school districts 
(e.g.. Bloom, et al.,2007 ), sample surveys (e.g.. Hedges and Hedberg, 2007), or state 
longitudinal data systems (e.g.. Hedges and Hedberg, 2014). Such reports generally provide 
estimates of the design parameters and an uncertainty (standard error) of those estimates. 
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Although intraclass correlation was originally introduced in relation to two level 
sampling models, the eoneept extends naturally to sampling models with three or more levels. 
The intraelass eorrelation eoneept in eases of three and four level sampling models is of great 
interest in the design and analysis of experiments in edueation (Hedges and Rhoads, 2010; 
Konstantopoulos, 2008ab, 2009). 

Consequently there has been eonsiderable interest in the estimation of intraelass 
eorrelations from sample surveys using multistage samples to estimate intraelass eorrelations 
(e.g.. Hedges & Hedberg, 2007). Sueh studies typieally fit uneonditional multilevel models to 
the survey data to estimate varianee eomponents at eaeh level of the sampling design. Other 
studies use datasets that were assembled in the eourse of earrying out randomized experiments 
(Bloom et al., 2007). While the surveys often have large total sample sizes, the number of 
sampled units at some levels (typieally the higher levels) may not be large enough make 
negligible the sampling uneertainty of estimates of varianee eomponents and funetions of 
varianee eomponents (sueh as intraelass eorrelations). Even experiments that are normatively 
large (that is large for experiments) typieally involve mueh smaller sets of sehools than national 
surveys (typieally less than 100 sehools). Consequently, for either type of study designed to 
provide referenee values of intraelass eorrelations, it is important to provide some assessment of 
the uneertainty of the estimates. However beeause the sample sizes in surveys are aetually large, 
estimates of sampling uneertainty based on large sample methods should be aeeurate enough to 
give this guidanee. 

In the ease of experiments with randomized bloek designs, statistieal power depends on 
an additional parameter: the variation of treatment effeets aeross higher-level units. Different 
investigators have used several different variants of this parameter, but there has been little 
researeh on the estimation of these treatment effeet heterogeneity parameters. 

Purpose / Objective / Research Question / Focus of Study: 

The purpose of this paper is to eonsider estimators of the treatment effeet heterogeneity 
parameters and their sampling distributions. In addition, the estimation and sampling 
distributions of intraelass eorrelations are also reviewed. This researeh should be of interest to 
investigators who wish to report empirieal estimates of treatment effeet heterogeneity parameters 
(and their uneertainties) from individual experiments or larger datasets. 

Significance / Novelty of study: 

The literature on estimation of intraelass eorrelations, however, has largely been 
restrieted to the ease of two level models. We review this literature. However, lihle work as 
been done to outline the varianee of heterogeneity parameters. This paper brings to the literature 
new formulas for the estimation of the varianee of heterogeneity parameters. 

Statistical, Measurement, or Econometric Model: 

Variance of ICC for T wo Level Model 

In a two level model there are two varianee eomponents cr/^ and 02 , whieh are estimated 
2 2 2 2 2 2 
by Si and $2 , respeetively, where si ~ ai . Let v; and V 2 be the varianees of si and $2 . The 
2 2 

eondhion si ~ oi implies that v; = 0. Let m denote the number of elusters (level 2 units) and m 
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denote the number of level 2 units in the level 2 unit. When the design is balaneed, ti] = ' = 
Hm = n. The interelass eorrelation in the two level model is 

P= 2 2 

a, +0-2 

9 

whieh is estimated by 


2 , 2 
h +^2 . 

Then the large sample varianee of r, the estimate of yO in a balaneed design is 

(1-pf ^2 

( 1 ) 

2 2 
where V 2 is the varianee of S 2 , the sample estimate of the level 2 varianee eomponent and aj = 

+ 02 is the total variance. The sample estimate of the variance of r is obtained by replacing 

all of the parameters in (1) by their samples estimates, that is the estimate of the variance of r is 

The standard error of r is just the square root of its estimated variance. 

Variance of Heterogeneity Parameters for Two Level Block Randomized Design 

Suppose that there are m level 2 units (blocks or clusters such as schools) and there are 
level 1 units (individuals) in each level 2 unit. Let Yy be the outcome score for the jth level 1 
unit in the ith level 2 unit. The level 1 model is 

Yij = foi + fiiTij + Sij,i= Mi, ( 2 ) 

Where foi is the mean and fa is the treatment effect in the ith level 1 unit, Ty is a treatment 
indicator, and is a normally distributed level 1 residual with mean zero and variance oi^ . The 
level 2 model is 

foi = Jo + (3) 
fii = yi + riii,i= 1, ...,m, 

where yo is the grand mean, y; is the mean treatment effect, rjoi is a normally distributed level 2 
residual with mean zero and variance 02 , and rjn is a normally distributed level 2 residual with 
mean zero and variance 

Statistical power in randomized block designs depends on the significance level and the 
sample sizes at each level, but also on the effect size, the intraclass correlation, and the treatment 
effect heterogeneity, although the precise relation depends on the way the treatment effect size 
and treatment effect heterogeneity are expressed. 

The effect size can be defined as 


ST = rJ 4^1 +0-2 

( 4 ) 

or as 


Sw = yi/oi , 

( 5 ) 

(see Hedges, 2007). 
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Note that the effeet size and the intraclass correlation are scale free, that is, d and p are 
defined as a proportion of the total variance of an untreated population and therefore (like the 
effect size) it does not depend on the scale of the outcome variable. 

The most direct parameter used to express the treatment effect heterogeneity is the 
treatment effect variance (the treatment by level 2 unit interaction variance component) . For 
the purposes of providing reference values that might apply across experiments, x has the 
disadvantage that is not scale free — it depends on the scale of the outcome variable. 
Consequently, discussions of statistical power in randomized block experiments have usually 
relied on transformations of x that are scale free. One of these has been called the effect size 
variance 

ESV=x^/a]\ (6) 

which corresponds to the variance of the effect sizes that would be computed if each level 2 unit 
were a separate experiment, that is the variance of the Si = fiu/ai values (see Spybrook, et al, 
2006). Another parameter that has been used is 

(O = x/o2, (7) 

which is the ratio of treatment effect variance to the variance of level 2 units (Hedges and 
Rhoads, 2010; Hedges and Borenstein, 2014). 

If the number of units receiving each treatment is identical within every level 2 unit, the 
design is balanced. In balanced designs, there are simple presentations of the treatment 
heterogeneity indices in terms of F-statistics from the treatment by level 2 unit analysis of 
variance. The analysis of variance F-statistic for the treatment by level 2 unit interaction {Fab) 
has a sampling distribution that is [{nx /2 + ai )!ai ] times a central F random variable with (m - 
1) and {mn -2m) degrees of freedom (see, e.g., Searle, 1971). Because the expected value of this 
central F is {mn -2m)l{mn - 2m - 2), it follows that the expected value of Fab is 


it follows that 




mn - 2m - 2 
mn — 2m 




is an unbiased estimator of FSV. Because the variance of the relevant central F is 
2(mn-2m)^ [mn — m — 3) 

(m-l)(mn-2m-2)^ (mn-2m-4) 
it follows that the variance of (8) is 

+4nx^af +4cr(’^[2(mn-m-3)] 
n^a^ (m-l)(mn-2m-4) 

_{nFSV + 2f[2m{n-l)-6] 
n^ (m-l)(mn-2m-4) 

The estimator of co is obtained from the variance component estimates 
2{MSAB-MSF) _ 2(F^^ -l) 


MSB- MS F 


Fs-^ 
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where MSAB, MSB, MSE, Fab, and Fb are the treatment by level 2 interaetion, level 2 main 
effect, and within cell or error mean squares and Fab and Fb are the treatment by level 2 
interaction and level 2 main effect F-test statistics from the two factor analysis of variance. 
Although the numerator and denominator of (10) are unbiased estimates of r and <72 , 
respectively, the ratio is not an unbiased estimator of r 1 02 , but generally estimates a quantity 
that is larger than co. The variance of (10) is approximately 


8ap^ 


+ 


8 


8p 


{jn — \^ n^m{ji — 2^ I m-1 — 


- + - 


4p lap 


+ ■ 


(O 


( 11 ) 


where 


mn - m - 1 
a = — — 

m(n-2)(m-l) 

and^ =(i-p)/p 

. Obviously if we are estimating co, neither co nor p will be known so that we 
will have to substitute estimates of co and p for the parameter values in (1 1), which results in a 
consistent estimator of the variance. 


Usefulness / Applicability of Method: 

Stata programs to estimate the variance of ICCs are available in the package ICCVAR (to install, 
type “ssc install iccvar” into the stata command prompt). Code for estimating the variance of 
heterogeneity parameters is also being developed. 

Conclusions: 

This work allows the growing community of design scientists to estimate variances of important 
parameters so that users of compendiums are aware of the sampling variability of their 
recommendations. 

One limitation of the methods presented here is that they assume a balanced design. We are 
working on derivations that allow for unbalanced designs and hope to have them ready for the 
conference. 
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